YouTube Project: Analyzing freeCodeCamp Comments¶
Overview¶
This project focuses on analyzing comments from freeCodeCamp's YouTube channel to uncover insights about viewer sentiment and key topics of discussion. By leveraging the YouTube Data API, I gathered video and comment data, which was then preprocessed to remove noise such as punctuation, stopwords, and emojis. I used sentiment analysis with a BERTweet model to classify the sentiment of each comment, and applied BERTopic for topic modeling to identify recurring themes across the comments.
The project aims to provide a deeper understanding of how viewers interact with freeCodeCamp content, what topics are of interest to them, and how they feel about the videos. This aligns with the concept of social listening, which involves monitoring and analyzing online conversations to gain insights into public opinion, identify trends, and understand the emotional tone surrounding specific topics or brands.
What is Social Listening?¶
Social listening refers to the process of monitoring digital conversations to understand what is being said about a brand, product, or topic on social media platforms and other online spaces. It involves gathering data from sources like social media comments, blog posts, and forums, and analyzing it to gain actionable insights. Social listening can help identify consumer sentiments, emerging trends, and feedback that can inform marketing strategies, product development, and customer engagement efforts.
In the context of this project, social listening is applied by analyzing YouTube comments on freeCodeCamp’s videos. The sentiment and topic modeling results provide valuable feedback on how the audience is responding to the content. By understanding these dynamics, I can assess the effectiveness of freeCodeCamp's educational videos and recognize areas for improvement or growth.
It is important to note that this analysis is based on a small sample of data due to API limits. The data collected represents only a subset of the videos and comments from freeCodeCamp’s YouTube channel. This sample was extracted to demonstrate the process, but a more comprehensive analysis could be performed with broader data access.
#Import modules
from googleapiclient.errors import HttpError
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import emoji
from transformers import pipeline
from bertopic import BERTopic
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
import string
from PIL import Image
from googleapiclient.discovery import build
import time
import random
import plotly.express as px
from collections import Counter
from IPython.display import display, Markdown, HTML
import plotly.io as pio
pio.renderers.default = "notebook"
# Download NLTK resources (run only once)
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
Fetch data from API¶
# Set up YouTube API credentials
api_key = ""
youtube = build("youtube", "v3", developerKey=api_key)
# Define the channel ID for freeCodeCamp
channel_id = "UC8butISFwT-Wl7EV0hUK0BQ" # freeCodeCamp.org YouTube Channel ID
# Function to fetch videos from the channel
def fetch_comments(video_id, max_comments=500):
comments = []
next_page_token = None
comment_count = 0
while comment_count < max_comments:
try:
request = youtube.commentThreads().list(
part="snippet", videoId=video_id, textFormat="plainText", maxResults=100, pageToken=next_page_token
)
response = request.execute()
for item in response['items']:
comment_data = {
'Comment': item['snippet']['topLevelComment']['snippet']['textDisplay'],
'Author': item['snippet']['topLevelComment']['snippet']['authorDisplayName'],
'Comment ID': item['snippet']['topLevelComment']['id'],
'Published At': item['snippet']['topLevelComment']['snippet']['publishedAt']
}
comments.append(comment_data)
comment_count += 1
if comment_count >= max_comments:
break
next_page_token = response.get('nextPageToken')
if not next_page_token:
break
time.sleep(random.uniform(1, 3)) # Add random delay to reduce API load
except HttpError as e:
if e.resp.status == 403: # Quota exceeded error
print("Quota exceeded while fetching comments. Stopping the data collection.")
return pd.DataFrame(comments) # Return whatever has been collected so far
else:
print(f"Error fetching comments for video {video_id}: {e}")
time.sleep(10) # If an error occurs, wait before retrying
return pd.DataFrame(comments)
def fetch_videos(channel_id, max_videos=500):
videos = []
request = youtube.channels().list(part="contentDetails", id=channel_id)
response = request.execute()
playlist_id = response['items'][0]['contentDetails']['relatedPlaylists']['uploads']
next_page_token = None
video_count = 0
while video_count < max_videos:
try:
request = youtube.playlistItems().list(
part="snippet", playlistId=playlist_id, maxResults=50, pageToken=next_page_token
)
response = request.execute()
for item in response['items']:
video_data = {
'Video Title': item['snippet']['title'],
'Video ID': item['snippet']['resourceId']['videoId'],
'Published At': item['snippet']['publishedAt'],
'Description': item['snippet']['description'],
'Video URL': f"https://www.youtube.com/watch?v={item['snippet']['resourceId']['videoId']}"
}
videos.append(video_data)
video_count += 1
if video_count >= max_videos:
break
next_page_token = response.get('nextPageToken')
if not next_page_token:
break
time.sleep(random.uniform(1, 3)) # Avoid exceeding rate limits
except HttpError as e:
if e.resp.status == 403: # Quota exceeded error
print("Quota exceeded while fetching videos. Stopping the data collection.")
return pd.DataFrame(videos) # Return whatever has been collected so far
else:
print(f"Error fetching videos: {e}")
time.sleep(10) # If an error occurs, wait before retrying
return pd.DataFrame(videos)
# Collect video data (limit to 500 videos for quota safety)
videos_df = fetch_videos(channel_id, max_videos=500)
print(f"Collected {len(videos_df)} videos.")
# Collect comments for each video (limit to 500 comments per video)
comments_df = pd.DataFrame(columns=['Comment', 'Author', 'Comment ID', 'Published At', 'Video ID'])
for video_id in videos_df['Video ID']:
print(f"Fetching comments for video {video_id}")
video_comments_df = fetch_comments(video_id, max_comments=500)
if not video_comments_df.empty:
video_comments_df['Video ID'] = video_id
comments_df = pd.concat([comments_df, video_comments_df], ignore_index=True)
# Save collected data at each step to avoid losing progress
videos_df.to_csv("freecodecamp_videos.csv", index=False)
comments_df.to_csv("freecodecamp_comments.csv", index=False)
# If quota is exceeded, break out of the loop
if video_comments_df.empty:
break
print(f"Collected {len(comments_df)} comments.")
Data Cleaning¶
comments_df = pd.read_csv('freecodecamp_comments.csv')
videos_df = pd.read_csv('freecodecamp_videos.csv')
comments_df.head()
videos_df.head()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
"""
Preprocesses text by removing punctuation, converting to lowercase, and removing stopwords.
Args:
text (str): The text to preprocess.
Returns:
str: The preprocessed text.
"""
if pd.isna(text):
return None
else:
# Remove punctuation and convert to lower case
text = text.translate(str.maketrans("", "", string.punctuation)).lower()
# Remove stopwords
text = " ".join([word for word in text.split() if word not in stop_words])
return text
# Apply preprocessing to the 'Comment' column
comments_df['Processed_Text'] = comments_df['Comment'].apply(preprocess_text)
def translate_emojis(text):
"""
Translates emojis in text to their textual descriptions.
Args:
text (str): The text containing emojis.
Returns:
str: The text with emojis translated.
"""
if pd.isna(text):
return None
else:
return emoji.demojize(text)#.replace(":", " ")
# Apply demojize to the entire 'Comment' column
comments_df['Comment_no_emojis'] = comments_df['Processed_Text'].apply(lambda text: translate_emojis(text))
Missing values¶
videos_df.isnull().sum()
comments_df.isnull().sum()
#Remove the rows with missing comments
comments_df = comments_df.dropna()
#Extract date
comments_df['Published_date'] = pd.to_datetime(comments_df['Published At']).dt.date
videos_df['Published_date'] = pd.to_datetime(videos_df['Published At']).dt.date
Descriptive Statistics¶
plt.figure(figsize=(12, 6))
sns.histplot(comments_df['Published_date'], bins=50, kde=True)
plt.xlabel("Published Date")
plt.ylabel("Number of Comments")
plt.title("Distribution of YouTube Comments Over Time")
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(12, 6))
sns.histplot(comments_df.Comment.apply(len), bins=200, kde=True)
plt.xlabel("Character Length")
plt.ylabel("Number of Comments")
plt.title("Distribution of Comment Character Length")
plt.xticks(rotation=45)
plt.show()
comments_df.Comment.apply(len).describe()#Comment length
def wc_preprocess_text(text):
# Remove punctuation and convert to lower case
text = text.translate(str.maketrans("", "", string.punctuation)).lower()
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Tokenize the text
tokens = word_tokenize(text.lower()) # Tokenize and convert to lowercase
# Lemmatize the tokens
tokens = [lemmatizer.lemmatize(word) for word in tokens if not word.isnumeric()]
return tokens
comments_df['processed_text_tokens'] = comments_df['Processed_Text'].apply(wc_preprocess_text)
# Flatten the list of tokens
all_words = [word for tokens in comments_df['processed_text_tokens'] for word in tokens]
# Count word occurrences
word_counts = Counter(all_words)
# Get the top 10 most common words
top_10_words = word_counts.most_common(10)
display(Markdown(f"##### Key Metrics"))
display(Markdown(f"* Total Comments Analyzed: {len(comments_df)} for {len(videos_df)} videos, which is equal to around {round(len(comments_df)/len(videos_df),2)} comments per video"))
display(Markdown(f"* Date Range of Comments: {comments_df['Published_date'].min()} to {comments_df['Published_date'].max()}"))
display(Markdown(f"* Unique Users: {len(comments_df['Author'].unique())}"))
display(Markdown(f"* Average Comment Length: {comments_df.Comment.apply(len).mean().astype(int)} characters"))
display(Markdown(f"* Most Frequent Words:"))
display(pd.DataFrame(top_10_words,columns=['Word','Count']))
Conclusion¶
The descriptive statistics provide a quantitative overview of the dataset, highlighting engagement levels and common patterns in comment length and vocabulary usage.
Word Cloud
def bag_of_words_tokens_to_string(tokens):
# Flatten the list and join items with a space
flattened_string = ' '.join([item for sublist in tokens for item in sublist])
return flattened_string
comment_mask = np.array(Image.open('./comment_mask.png'))
def plot_bag_of_words(data=comments_df,sentiment='ALL',colormap = 'BuPu_r'):
# Generate the word cloud
if sentiment == 'ALL':
bag_of_wrods_string = bag_of_words_tokens_to_string(list(data['processed_text_tokens']))
else:
sent_tokens_dict = {'POS':list(data[data['Sentiment_label']=='POS']['processed_text_tokens']),
'NEG':list(data[data['Sentiment_label']=='NEG']['processed_text_tokens']),
'NEU':list(data[data['Sentiment_label']=='NEU']['processed_text_tokens'])}
bag_of_wrods_string = bag_of_words_tokens_to_string(sent_tokens_dict[sentiment])
wordcloud = WordCloud(background_color = 'white', mask = comment_mask, contour_width = 2,
contour_color = 'black', colormap = colormap, width = 800, height = 500).generate(bag_of_wrods_string)
plt.imshow(wordcloud)
plt.axis("off")
return wordcloud
all_cloud = plot_bag_of_words()
all_cloud.to_file("wordcloud_all.png")
Sentiment analysis¶
Using bertweet-base-sentiment-analysis helps classify comments as positive (POS), negative (NEG), or neutral (NEU) Pérez et al., 2021.
Why Use This Model?¶
- Trained on social media - Handles slang, emojis, and informal text.
- Easy to use with Hugging Face Pipelines.
sentiment_pipeline = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
Handling Emojis in Sentiment Analysis¶
Emojis can impact sentiment classification. Let's compare how the model classifies text with emojis vs. after emoji translation.
def detect_emojis(text):
emoji_list = [char for char in text if emoji.is_emoji(char)]
return len(emoji_list)>0
text_with_emoji_list = [text for text in comments_df['Processed_Text'].astype(str) if detect_emojis(text)]
print(f'{len(text_with_emoji_list)}({round(len(text_with_emoji_list)/len(comments_df),2)*100}%) comments have at least one emojis')
sentiment_without_emoji = []
sentiment_with_emoji = []
comment = []
for idx,row in comments_df.iterrows():
text_with_emojis = row['Processed_Text']
text_without_emojis = row['Comment_no_emojis']
if detect_emojis(text_with_emojis):
sent_without = sentiment_pipeline([text_without_emojis],truncation=True)
sent_with = sentiment_pipeline([text_with_emojis],truncation=True)
sentiment_without_emoji.append(sent_without[0]['label'])
sentiment_with_emoji.append(sent_with[0]['label'])
comment.append(row['Comment'])
emoji_sent_comapre_df = pd.DataFrame({'Comment':comment,'Sentiment_without_emoji':sentiment_without_emoji,'Sentiment_with_emoji':sentiment_with_emoji})
#Plot the sentiment distributions for texts with emojis versus those without emojis.
plt.figure(figsize=(10,5))
sns.histplot(emoji_sent_comapre_df["Sentiment_with_emoji"], label="With Emoji", color="blue", alpha=0.6)
sns.histplot(emoji_sent_comapre_df["Sentiment_without_emoji"], label="Translated", color="red", alpha=0.6)
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution: Emoji vs. Translated")
plt.legend()
plt.show()
# Get a clear view of the sentiment transitions between with and without emojis
tras_value = {
'NEU':{'POS':0,'NEG':0},
'POS':{'NEG':0,'NEU':0},
'NEG':{'POS':0,'NEU':0}
}
for idx,row in emoji_sent_comapre_df.iterrows():
if row['Sentiment_with_emoji']!=row['Sentiment_without_emoji']:
tras_value[row['Sentiment_with_emoji']][row['Sentiment_without_emoji']] += 1
# 'tras_value' is a dictionary that keeps track of the transitions in sentiment labels between two columns:
# 'Sentiment_with_emoji' and 'Sentiment_without_emoji'. The dictionary stores the count of how often
# a specific sentiment changes from one category to another (from 'with emoji' to 'without emoji').
# The structure is as follows:
# - 'NEU' -> Neutral sentiment
# - 'POS' -> Positive sentiment
# - 'NEG' -> Negative sentiment
#
# For example, if a sentiment with an emoji is 'POS' and the sentiment without an emoji is 'NEG',
# 'tras_value['POS']['NEG']' will increment by 1, indicating that a positive sentiment with an emoji
# transitions to a negative sentiment without the emoji.
tras_value
# Let's look at some examples of sentiment transitions
emoji_non_emoji_sentiment_change = emoji_sent_comapre_df[emoji_sent_comapre_df['Sentiment_without_emoji']!=emoji_sent_comapre_df['Sentiment_with_emoji']].sample(n=20)
for idx,row in emoji_non_emoji_sentiment_change.iterrows():
print(f"Comment {idx}: {row['Comment']}")
print(f'Sentiment without emoji: {row["Sentiment_without_emoji"]}')
print(f'Sentiment with emoji: {row["Sentiment_with_emoji"]}')
A great example of why translating emojis instead of leaving them as-is (or removing them) is beneficial can be seen in the following cases:
The comment "Deepseek gonna takeover everything soon 😢" transitioned from Neutral to Negative after the emoji was translated.
Similarly, "Hi, I started this course about a week ago and I am having a bit of trouble with the SQL intermediate phase and would really appreciate some assistance 😢" shifted from Neutral to Negative once the emoji was translated.
Some comments containing only an emoji, like "❤," seemed to transition from Neutral to Positive/Negative after the emoji was translated.
It appears that translating the emojis improved performance, with most changes occurring from Neutral to either Positive or Negative. This suggests that translating the emoji helps the model correctly understand the context of comments with emojis.
non_emoji_docs = list(comments_df.Comment_no_emojis)
#Long texts will be be truncuated
comments_df['Sentiment'] = sentiment_pipeline(non_emoji_docs,truncation=True)
comments_df['Sentiment_label'] = comments_df['Sentiment'].apply(lambda x: x['label'])
comments_df['Sentiment_score'] = comments_df['Sentiment'].apply(lambda x: x['score'])
pio.renderers.default = "notebook"
sent_freq = comments_df['Sentiment_label'].value_counts().reset_index()
sent_freq['p'] = sent_freq['count']/sent_freq['count'].sum()
fig = px.pie(sent_freq, values='p', names='Sentiment_label', title='Percentage of comments by sentiment category')
fig.show()
Let's look at some examples of each sentiment
# Define sentiments
sentiments = {'POS':'Greens', 'NEG':'Reds', 'NEU':'spring'}
for sentiment,colormap in sentiments.items():
display(Markdown(f"## {sentiment}: Example documents"))
random_examples = comments_df[comments_df['Sentiment_label']==sentiment].sample(n=10)['Comment'].values
for idx,comment in enumerate(random_examples):
print(f'Example {idx}: {comment}')
display(Markdown(f"## {sentiment}: Word cloud"))
cloud = plot_bag_of_words(sentiment=sentiment,colormap=colormap)
cloud.to_file(f"wordcloud_{sentiment}.png")
Distribution of YouTube Comments Over Time Grouped by Sentiment¶
plt.figure(figsize=(12, 6))
sns.histplot(data = comments_df,x='Published_date',hue='Sentiment_label', bins=50, kde=True)
plt.xlabel("Published Date")
plt.ylabel("Number of Comments")
plt.title("Distribution of YouTube Comments Over Time grouped by sentiment")
plt.xticks(rotation=45)
plt.show()
display(Markdown(f"##### Key Metrics"))
for sentiment in sentiments.keys():
sentiment_len = len(comments_df[comments_df['Sentiment_label']==sentiment])
total_len = len(comments_df)
display(Markdown(f"* {sentiment} Sentiments: {sentiment_len}({round(sentiment_len/total_len*100,2)} %) comments"))
Sentiment analysis Conclusion¶
The sentiment analysis shows that most comments are neutral, with positive sentiment comprising a substantial portion and negative sentiment highlighting areas for potential improvement.
Topic Modeling (BERTopic)¶
To extract meaningful themes, BERTopic was applied separately to each sentiment category:
Positive Comments: Identifying themes in praises, appreciation, and positive feedback.
Negative Comments: Highlighting criticisms, complaints, and areas for improvement.
Neutral Comments: Extracting general discussion topics without strong sentiment.
sent_model_result = {'POS':{},
'NEG':{},
'NEU':{}}
for sentiment in ['POS','NEG','NEU']:
print(f'Processing {sentiment} comments')
print(f'Number of comments: {len(comments_df[comments_df.Sentiment_label==sentiment])}')
non_emoji_docs_sent = list(comments_df.Comment_no_emojis[comments_df.Sentiment_label==sentiment].astype(str))
topic_model = BERTopic(nr_topics=11)
topics, probs = topic_model.fit_transform(non_emoji_docs_sent)
sent_model_result[sentiment] = {'topics':topics,'probs':probs,'topic_model':topic_model}
For each sentiment:
After generating topics and their probabilities:
I can access the most frequent topics that were generated.
Investigate the relationships between topics.
Notes:
Topic -1 refers to all outliers and should typically be ignored.
I limited the number of topics to 10 to improve interpretability and ensure meaningful topic extraction using the BERTopic model.
# Define sentiments
sentiments = ['POS', 'NEG', 'NEU']
topic_summary_df = pd.DataFrame()
for sentiment in sentiments:
display(Markdown(f"## {sentiment}: Sentiment Analysis"))
topic_model = sent_model_result[sentiment]['topic_model']
# Get sorted topics
get_topic_info_sorted = topic_model.get_topic_info().sort_values(by='Count', ascending=False)
sorted_topics = list(get_topic_info_sorted['Topic'])
# Remove outlier topic (-1)
if -1 in sorted_topics:
sorted_topics.remove(-1)
display(get_topic_info_sorted) # Display DataFrame in Jupyter
# Retrieve the top 2 topics based on their frequency (count)
top_2_topics = get_topic_info_sorted[get_topic_info_sorted['Topic'].isin(sorted_topics)].reset_index(drop=True).loc[:1]
# Get the top keywords representing the selected topics
Top_Keywords = top_2_topics['Representation']
# Calculate the percentage of comments for the selected topics
Percentage_of_Comments = round(top_2_topics['Count']/get_topic_info_sorted['Count'].sum()*100,2)
# Convert Percentage_of_Comments to string and add '%' suffix
Percentage_of_Comments = Percentage_of_Comments.astype(str) + '%'
# Create a temporary DataFrame for the sentiment with top keywords and percentage of comments
sent_topic_summary_df = pd.DataFrame({'Sentiment':sentiment,'Top Keywords':Top_Keywords,'Percentage of Comments':Percentage_of_Comments})
# Concatenate the temporary DataFrame to the main summary DataFrame
topic_summary_df = pd.concat([topic_summary_df,sent_topic_summary_df])
# Generate topic bar chart
display(Markdown(f"### Top Words per Topic"))
fig = topic_model.visualize_barchart(topics=sorted_topics)
fig.show()
# Generate topic visualization
display(Markdown(f"### Topic Clustering Visualization"))
try:
fig = topic_model.visualize_topics()
fig.write_html(f"{sentiment.lower()}_2d.html")
fig.show()
except:
print('Not enough topics for 2D visualization')
# Generate heatmap
display(Markdown(f"### Topic Similarity Heatmap"))
try:
fig = topic_model.visualize_heatmap()
fig.show()
except:
print('Not enough topics for heatmap')
Topic modeling overview¶
Key Metrics: Top Topic Extracted from Each Sentiment¶
Positive Sentiment (POS):
Top Keywords: "course", "thank", "video", "great", "thanks"
Percentage of Comments from positive sentiment comments: 34.31%
Dominated by expressions of thanks and appreciation, with some mentions of specific technical topics like "tutorial" and "react."
Negative Sentiment (NEG):
Top Keywords: "video", "accent", "english", "understand", "ads"
Percentage of Comments from negative sentiment comments: 16.59%
Negative sentiments reflect issues with understanding accents, video quality, and ads.
Neutral Sentiment (NEU):
Top Keywords: "course", "video", "code", "ai", "please"
Percentage of Comments from neutral sentiment comments: 26.48%
Neutral sentiments indicate general discussions of the course and coding, with users providing thanks and requests for clarification.
Conclusion¶
The topic modeling results provide valuable insights into audience sentiment and engagement with the content. Positive comments, are primarily driven by gratitude and appreciation for the course, with some focus on specific technical topics. Negative comments highlight areas for potential improvement, particularly regarding language barriers, video quality, and ad disruptions. Neutral comments reflect general course discussions, coding-related topics, and user requests for clarification.
These findings suggest that while the content is well-received overall, addressing concerns about accessibility and video experience could enhance viewer satisfaction.
Write file to csv
comments_df.to_csv('comments_df.csv')
Final Conclusion¶
By analyzing the topics and sentiment distributions, I gained insights into what drives positive engagement (helpful tutorials and positive feedback), where users are facing challenges (e.g., accents and technical errors), and what general discussions are occurring (e.g., coding and general feedback). These insights can inform content improvements, better targeting of audience needs, and refining user engagement strategies.
Trend Analysis¶
Over time, positive and neutral sentiments have shown an increasing trend, indicating that the audience's overall engagement and satisfaction with the content are improving. On the other hand, negative sentiment has remained relatively flat, without significant growth, suggesting that while issues exist, they have not worsened over time.
Next Steps¶
To further refine content strategy and improve user experience, the following steps can be taken:
Enhancing Accessibility: Since some negative feedback stems from difficulty understanding accents, providing subtitles, transcripts, or AI-generated voiceovers could improve accessibility and comprehension for a wider audience.
Reducing Disruptions: Many negative comments mention ads. Exploring ways to minimize ad interruptions—such as strategically placing them at natural breaks—could lead to improved viewer retention and satisfaction.
Encouraging Constructive Feedback: Given the predominance of positive and neutral engagement, fostering more structured feedback through polls or direct engagement with viewers could offer deeper insights into audience preferences.
Optimizing Technical Explanations: Some comments request clarification on technical topics, indicating the need for supplemental materials like written guides, coding exercises, or additional explanation videos.
The Role of Emojis in Sentiment Analysis¶
One of the most interesting findings in this analysis was the impact of emoji translation on model performance. Emojis are a fundamental part of social media language, often conveying emotions, reactions, and context that words alone may not capture. Ignoring or removing them can lead to misinterpretations of sentiment.
For example:¶
The comment "Deepseek gonna takeover everything soon 😢" transitioned from Neutral to Negative after the emoji was translated.
"Hi, I started this course about a week ago and I am having a bit of trouble with the SQL intermediate phase and would really appreciate some assistance 😢" shifted from Neutral to Negative once the emoji was translated.
Comments containing only an emoji, like "❤," transitioned from Neutral to either Positive or Negative after translation.
These examples highlight the importance of accounting for emojis in sentiment analysis. By translating emojis instead of leaving them as-is or removing them, the model correctly interprets the emotional intent behind the comments. Since social media heavily relies on emojis to express tone, sarcasm, and emotion, incorporating them into sentiment analysis is essential for accurate classification.
Future Analysis¶
To further expand this study and gain deeper insights, the following analyses could be conducted:
Multilingual Sentiment Analysis: Considering non-English comments by translating them during preprocessing or using multilingual models.
Interactive Data Exploration: Developing a Streamlit or Tableau dashboard to allow interactive sentiment and topic analysis.
Comprehensive Dataset Expansion: Extending the analysis to cover all videos and comments, including replies for deeper engagement insights.
Temporal Sentiment Trends: Analyzing sentiment shifts over time to identify patterns in audience reception and content effectiveness.
Engagement Correlation: Exploring the relationship between comment sentiment and key video engagement metrics (likes, shares, watch time) to determine what influences audience interaction.
Video-Specific Insights: Performing a per-video sentiment analysis to identify which content resonates best with the audience and why.
By implementing these improvements and analyses, I can further refine content strategy, enhance audience engagement, and ensure that sentiment analysis models accurately reflect the nuances of online discussions.
#HTML config
sidebar_html = """
<style>
body {
margin-left: 220px; /* Make space for the sidebar */
font-family: Arial, sans-serif;
}
.sidebar {
height: 100%;
width: 220px;
position: fixed;
left: 0;
top: 0;
background-color: #2c3e50;
padding-top: 10px;
box-shadow: 2px 0px 5px rgba(0, 0, 0, 0.2);
color: white;
transition: width 0.3s;
}
.sidebar .section {
padding: 12px 15px;
font-size: 16px;
font-weight: bold;
cursor: pointer;
background-color: #1a252f;
border-top: 1px solid #34495e;
display: flex;
align-items: center;
justify-content: space-between;
}
.sidebar .section:hover {
background-color: #34495e;
}
.sidebar .section span {
transition: transform 0.3s ease-in-out;
}
}
</style>
<div class="sidebar">
<div class="section" onclick="scrollToSection('overview')">Overview</div>
<div class="section" onclick="scrollToSection('api-call')">Fetch data from API</div>
<div class="section" onclick="scrollToSection('data-cleaning')">Data Cleaning</div>
<div class="section" onclick="scrollToSection('descriptive-statistics')">Descriptive Statistics</div>
<div class="section" onclick="scrollToSection('sentiment-analysis')">Sentiment analysis</div>
<div class="section" onclick="scrollToSection('topic-modeling')">Topic Modeling (BERTopic)</div>
<div class="section" onclick="scrollToSection('final-conclusion')">Final Conclusion</div>
</div>
<script>
function scrollToSection(id) {
var section = document.getElementById(id);
if (section) {
section.scrollIntoView({ behavior: "smooth" });
}
}
</script>
"""
display(HTML(sidebar_html))